A brief intro to thinking deeply about visualization and R code to do so.
By the end of this you will have had a whirlwind tour of the very tip of the data visualization best-practices iceberg. We will go over a broad range of topics generally applicable to data science usecases but not dive too deep into any single one. One thing to keep in mind the whole time is none of this is absolutely set in stone, most often in the real world you have to bend or break some of these rules to do what you want.
“The best camera is the one that’s with you.” –Chase Jarvis.
A very common question for people starting out with R and visualization is “which library should I use?” Like most things there is no right answer. Every situation is different. These are a few points to keep in mind in deciding which tool to use (hint: it really doesn’t matter).
Jeff Leek has a fantastic article on his blog about this issue.
Ultimately it comes down to what you know. You can do an absolutely amazing amount in most tools (even excel) so do what you like best.
For most people in R the choice is Ggplot vs Base. I mostly use Ggplot because it’s what I am the most familiar with (and it has nice defaults (more on this later)).
Whatever you choose will, in the not to distant future, be old and replaced by the new best thing, so understanding the concepts is a much better investment of your time. The next bit of this will be trying to reinforce good concepts.
A lot of data visualization is common sense, but some of it isn’t. These are a few of the examples of charts made that are not the best fit for the data that I frequently see.
Okay, let’s get the elephant out of the room first. The pie chart elicits a similar response in a data-viz person as a computer scientist’s prediction algorithm to a statistician. Initially claims of blaspheme but sometimes upon closer inspection grudging respect.
# a simple pie chart
data = data.frame(
val = c( 8 , 6 , 9 , 4 , 2 , 3.5),
labs = c("a", "b", "c", "d", "e", "f") )
pie(data$val, data$labs)
So why all the ire?
Humans have a very hard time interpreting angles, and that’s how a pie chart encodes the data. Looking at the code/chart above we know that d and f are 0.5 apart, or f is only 87.5% of the value of d, but upon initial inspection the average user would probably say they are the same.
We could use something called a tree map:
library(treemap)
treemap(data, c("labs"), "val")
This works similar in spirit to a pie chart, but encodes values in physical area rather than using an angle.
I would argue this is actually worse than the pie chart, but it is certianly a good option for some types of data. If you had a large number of values or hierarchically clustered data, treemaps can be excellent tools for looking at large amounts of data fast.
Even simpler you could do a stacked bar chart.
library(ggplot2)
ggplot(data, aes(1, val, fill=labs, width=0.2)) +
geom_col()
Same concept as the treemap in that value is encoded in area rather than angle. This would be good for a low number of comparisons with logical ordering or as a supplementary figure for a larger visualization.
Out of all of these options a plain bar chart is probably the most clear.
ggplot(data, aes(y = val, x = labs)) +
geom_col() +
labs("x" = "")
Using a bar chart we can clearly see that f is smaller than d.
There was a paper a few years ago by two super stars in data visualization Jeffrey Heer and Mike Bostock. In it they took a bunch of visual encodings of the same data (much like we are doing here) and showed them to people and asked them questions about what the data said. They then recorded these results and plotted them to show differences between encoding quality.
Pie charts are pretty far down there, but then again so are tree maps. If you did it for a different dataset I am betting you would get different results given which chart type the data fits best with. This raises the important question:
Is all the hate warranted for pie charts?
Penn postdoc Randal Olsen has a good blog post on pie charts.. It is a highly recommended read but to paraphrase his rules on pie charts:
People intuitively get pie charts so don’t rule out their use entirely, but make sure you are using them properly.
Bar charts are fantastic tools. It seems that more often than not they are the best visualization for the job, often out-competing more complicated flashy visualizations in terms of ease of reading/comprehension. There are some instances where they are not appropriate however.
As a general rule of thumb if the measure is a quantity of something then it makes sense to use a bar chart. This would include number of infections, a person’s weight etc.. A general heuristic I like to use when deciding to use a bar chart or not is ‘could I redraw the chart such that the bars are made up of individual instances of whatever the y-axis is encoding?’
Let’s look at a group of patients and their percentiles for vitamin d levels in their blood.
First we plot with a bar plot.
data <- data.frame(student = c("Tina", "Trish", "Kevin", "Rebecca", "Sarah"),
percentile = c(25, 95, 54, 70, 99) ) #percentile of d levels
p <- ggplot(data, aes(x = student, y = percentile))
p + geom_col()
The hierarchy of the data is clearly visible but the intuitive interpretation of the bar is slightly confusing. A percentile is not a sum of values but simply a place on the continuum of a scale. In addition, we have a tendency to assign good or bad to large or small levels of bar charts when in this case the middle would be best.
Let’s re-visualize the data as a dot-plot.
p +
geom_point(color = "steelblue", size = 4) +
theme_minimal() + # helps make the grid lines look more like guides
coord_flip()
This is more legible and intuitive. We see that the measure is simply a point where the student falls, not the accumulation of percentiles.
There are some exceptions to this rule. For instance: weight being looked at for a single person over time might be best shown on a line chart. Like almost everything in visualization, thinking carefully about what your data are before plotting them is important.
What if the thing we’re interested in is showing multiple observations of a given value? For instance, say we were looking at the expression of a protein of interest across different conditions. Like good scientists we took multiple measurements of each condition so we should represent that.
So I will forgive you if you want to use these plots just due to the fact they have the coolest name, however, don’t.
To illustrate why let’s look at the dynamite plot of our hypothetical expression experiment…
p <- ggplot(expression_summarized,aes(x = sample, y = average)) +
geom_errorbar(aes(ymin = average + sd, ymax = average + sd)) +
geom_linerange(aes(ymin = average, ymax = average + sd))
p + geom_col()
All these look pretty similar! Must be nothing really interesting going on. Maybe the first two conditions have higher peak? Or do the bars represent variablity? Also, does the top of the bar represent the bottom of an interval? Or the middle? I can never remember (because it changes from plot to plot).
Let’s actually look at the data that went into these plots. Usually with plots like these the number of datapoints going in is tiny so we may as well just show it.
p +
geom_col(alpha = 0.5) +
geom_jitter(data = expressions, aes(x = sample, y = expression), width = 0.1)
Oh, oh my, those aren’t the same at all. Let’s just clean this up a tiny bit.
ggplot(expression_summarized,aes(x = sample, y = average)) +
geom_pointrange(aes(ymin = average - sd, ymax = average + sd),
shape = 1, color = 'steelblue') +
geom_jitter(data = expressions, aes(x = sample, y = expression),
width = 0.1, alpha = 0.5)
If you want more evidence to push back against your PI with than the word of some random biostats grad student the blog post dynamite plots must die by Rafael Irizarry, chair of biostats at Harvard has much more indepth coverage of why dynamite plots are bad.
Box plots are, like the pie chart, one of the first visualization techniques we are taught. However, it is not necessarily a good one and many better new options have arisen.
The problem with box plots, much like dynamite plots, is they obscure trends at a resolution finer than the quantiles. Take for instance the following two box plots:
#Hiding the data input on purpose...
p <- ggplot(data, aes(dataset, val))
p +
geom_boxplot(fill = "steelblue", color = "grey") +
labs(title = "Box Plots")
Given the information that the standard box plot provides us we would say that these groups are identical.
What happens if we try another way of visualizing the distribution of data?
First let’s try a(nother form of the) dot plot:
p +
geom_dotplot(binaxis = "y", stackdir = "center",
fill = "steelblue", color = "steelblue", dotsize = 0.7) +
labs(title = "Dot Plots")
Now we can see that these data are very differently distributed.
Another method of visualizing the distribution of the groups is a violin plot. This is essentially a kernel density version of the dot plot. Useful for when the data are very large and a dot plot is not particularly useful due to the large number of dots drawn. However, if your data are small enough that you can actually visualize each point, do it.
p +
geom_violin(adjust = .5, fill = "steelblue", color = "steelblue") +
labs(title = "Violin Plots")
If you still want the familiarity of the box plot combined with the enhanced ability to see the underlying distribution you can combine the two plots as well.
p +
geom_dotplot(binaxis = "y", stackdir = "center",
fill = "steelblue", color = "steelblue") +
geom_boxplot(alpha = 0, size=1)
Now we get the standard and familiar inference of quantiles, combined with seeing the finer resolution information about the distribution.
Lastly we have the grand daddy of distribution visualizations, the histogram.
ggplot(data, aes(val)) +
geom_histogram(bins = 40) +
facet_wrap(~dataset)
That tells the story pretty well, too.
Nothing could be wrong with the simple well meaning histogram, right? Wrong, it is entirely possible to fall into the same issues as the box-plots with histograms. If your bins are aligned awkwardly with your data two histograms of the same data can look entirely different.
Look what happens if we switch up the bin number on our data.
ggplot(data, aes(val)) +
geom_histogram(bins = 10) +
facet_wrap(~dataset)
Here’s a brief animation of the histogram bin widths slowly changing while the underlying dataset doesn’t change. Be careful!
Read this article by statistician turned data-visualiation expert Nathan Yu on plotting distribution data for a much more thorough treatment of this issue.
Personal Plug: Use a sliding histogram (patent/trademark pending) to get rid of problems with binning but also keep intepretability that you lose with a kernal density plot. See my interactive demo.
First we draw a traditional word cloud of the Bertrand Russell’s “An essay on the foundations of geometry”
library(tm)
library(SnowballC)
library(wordcloud)
Russell_Geom <- readChar("data/Russell_Geometry.txt", file.info("data/Russell_Geometry.txt")$size)
text_corpus <- Corpus(VectorSource(Russell_Geom)) #Generate a corpus
text_corpus <- tm_map(text_corpus, content_transformer(tolower))
text_corpus <- tm_map(text_corpus, removePunctuation) #remove punctuation
#Remove commonly used words that dont add meaning. (e.g. I, Me)
text_corpus <- tm_map(text_corpus, removeWords, stopwords('english'))
wordcloud(text_corpus, max.words = 40, random.order = FALSE)
Ahh clearly we can grab very important information on the frequency of the words in this book…
Is “point” or “space” bigger? “Geometry” and “axiom”? Basically it’s impossible to tell.
Now let’s do it in a bar chart.
freq_df <- DocumentTermMatrix(text_corpus) %>%
as.matrix() %>%
colSums() %>% {tibble(
word = names(.),
frequency = .
)} %>%
arrange(-frequency)
#sort the data so ggplot respects the dataframe order
ggplot(head(freq_df, 40), aes(x = reorder(word, frequency), y = frequency)) +
geom_bar(stat = "identity") + labs(x = "Word") + #use a barchart and label the xaxis
coord_flip()
So while the bar chart might not be as flashy and cool it certainly more accurately coveys the information that you are trying to show.
That being said, if you are trying to simply make eye candy then go for the word cloud. However, if you are attempting to facilitate meaningful analysis stick to a bar-chart.
The re-arranging of axes is one of the most potentially damaging forms of data visualization mistakes. By truncating an axis you can entirely change the interpretation of a chart. You can exaggerate a difference or minimize it. A good example of this done with potentially dangerous side effects is a tweet sent out by the magazine National Review.
Look at that, we’ve all been getting way too worried about climate change! But wait, looks like they started their x-axis at 0. Seems like a good idea until you realize that 0 Fahrenheit means absolutely nothing. If you’re going to start a temperature at 0 you might as well go all the way and do Kelvin.
Let’s see an example of where truncating the axis is bad.
data <- tibble(
"date" = c(2010, 2011, 2012, 2013),
"deaths"= c(400, 402, 408, 412) )
p <- ggplot(data, aes(x = date, y = deaths)) +
geom_line() +
theme_bw() +
labs(title = "Hospital Deaths from 2010-2013")
p
Oh my, looks like we’ve had a massive spike in hospital deaths.
Deaths however, are a measurement that has a meaningful start point (zero). So let’s try and fix our axis scale to represent that.
p + ylim(0,450)
Turns out that was a false alarm (although still 12 more deaths might not be trivial).
Important point: ggplot automatically truncated the axis in this case. In a bar chart it wont let you set a non-zero axis without some esoteric scale commands but for many other plots (such as points and lines) it automatically truncates the axis so your data just fits in the limits. Be vigilant of this.
Let’s continue with our morbid theme by looking at Nicholas Cage movies, Tyler Vigen’s excelent site on spurious-correlations illustrates our next point very well. When you make a chart with two different axis you can basically make the data say anything you want.
Duke Professor Kieran Healy sums this up very well in a blog post titled “Two Y-Axes”.
This also goes with the previous point of axes truncation. You can see that by changing axes you can very drastically change interpretations.
Ggplot doesn’t even allow multiple axes at all as Hadley Wickham is strongly against the practice. (Again, good defaults.)
Say you have a lot of time series data. You might want to compare temporal trends in some measurement for patients in a clinical trial. One natural tendency might be to plot all of their values on the same plot, like below.
x_vals <- 1:50
lots_o_lines <- purrr::map_df(
letters,
~tibble(
x = x_vals,
y = sin(x_vals + rnorm(1))*rnorm(1) + rnorm(50),
id = .
))
#plot with different lines of different letters.
p <- ggplot(lots_o_lines, aes(x = x, y = y))
p +
geom_line(aes(color = id)) +
labs(title = "Delicious Data Spaghetti", x = "time")
Well this is a mess. You really can’t tell what’s going on in any way. If you want to see any trends or potential outliers you better be able to distinguish between the shade of green for k and i, and then be able to filter out all the noise and run 50 choose 2 comparisons in your head.
A way to get around this is using a technique known as small multiples. In small multiples you have a bunch of little tiny charts all with a single data element. So in this case it would be 50 separate line plots with one line each.
p +
geom_line() +
facet_wrap(~id) + #Facet on each line and draw a seperate plot for each.
labs(title = "Small multiple lines", x = "time")
As you can see patterns are much easier to see and outliers pop out immediately.
There is another method of dealing with this information overload. Say you have explored your data and want to highlight a single (or maybe two) value in the context of the others. You can highlight that individual line (or whatever graphical element you desire) to call attention to it alone in the chart. This is much more of a explanatory data visualization technique but it does work very well for showing context for an individual element.
library(gghighlight)
p +
geom_line(aes(color = id)) +
# Highlight just a single line
gghighlight(id == 'z')
Sometimes however (are you sensing a theme?) information overload can be used to your advantage.
Take for instance the New York Times delegate prediction model:
In this situation we have a ton of lines, way more than a user can truly parse at one time. In this case this is an intentional method of illustrating uncertainty. This is a topic that could encompase an entire course. If you are interested in the cutting edge research on uncertianty visualization I suggest you look at Jessica Hulman of the University of Washington’s recent work on the topic.
As a side note I feel this is a specific area in which statisticians should be doing the inovating. We understand uncertainty better than most and as is evidenced in Jessica Hulman’s post they are representing multiple things the same that perhaps shouldn’t be (confidence and credible intervals). If anyone has ideas on this, talk to me!
3d charts are cool and very tempting to make, but they are fraught with all sorts of problems. The main one being that perspective (literally) matters. Just like real life, stuff looks bigger the closer it is, so unless your viewer is going to be viewing your visualization on an oculus rift with stereo 3d (I have a visualization like this if you are interested) you should stick to two dimensions. (That being said, per usual, there are some ways around this that are acceptable.)
With that I give you potentially the worst data visualization ever created:
As we already talked about pie charts are dangerous as slices with different values can look very similar. Once you take that and add in the perspective skewings of the third dimension you get a perfect storm of misleading. I have no suggestions on how to fix this as there are none; it should probably be burned. Just don’t do it. (But later on I will demonstrate an example of when you can use a 3d visualization and be mostly okay.)
While R vs. Python is a heated battle in the statistics community, a much more vitriolic battle is waged on the R sideline over plotting vs base graphics. Jeff Leek’s aforementioned article, while written in a tone calling for understanding on the two sides simply ignited passions to hereto unseen levels.
Ultimately, ggplot has its positives and negatives.
It is pretty dang hard to make a plot with ggplot that looks bad. Base graphics? Pretty easy. This is good as it has helped many people put out better graphics than they otherwise would have.
The “gg” in ggplot stands for grammar of graphics which is a framework for plotting developed by Leland Wilkinson in his book The Grammar of Graphics. The basic tenants behind this methodology are that you start with your data, and then you assign a geometry to elements of that data, such as circle size to population, then you draw those geometries based upon some scaling of your data. When you think about visualization this way it helps you develop a better understanding of the data itself and think of proper ways to visualize it. (Think the bar vs dot chart.)
Due to the grammar of graphics aspect ggplot is rather intelligible. For instance, while writing a line chart takes more characters of code than it does in base graphics it tends to be much clearer what is going on.
#base graphics
plot(x = df$date, y = df$weight, type = "l", col = "blue")
##ggplot
ggplot(data = df, aes(x = date, y = weight)) + geom_line(color = "blue")
Is col columns? type also seems rather esoteric and would require looking up definitions. In ggplot you can see what x is mapping to, what y is mapping to in your data, geom_line is rather clear that it’s drawing a line and coloring it blue.
This is important for sharing code with potentially less fluent coders.
In addition, to me, the ggplot chart is much more pleasing. The lines in the background facilitate comparisons of values far away from eachother, the axes have good names, and there is no harsh boundry around the plot.
It generally takes a good bit of time to construct a ggplot graphic. Base allows you to rapidly get a plot up and running. For instance if you want to check if your simulation is running properly or if something interesting is happening in your data a quick plot(x,y) is usually more than enough. It doesn’t need to look pretty for you.
Want to plot a bunch of different charts on a single plot? With ggplot the charts generated with facet have to be of the same geometry. If you want to put together a line and bar plot you need to use another library called grid which is a pain, especially considering that it’s a single simple command it base (par(mfrow = c(a,b))).
Like said at the beginning of this document, choose your plotting library and then apply the above principles in it. Very rarely will you need to jump to a whole different library to do something. If you do, that’s why stackoverflow was invented.
Believe it or not there ways to plot data in R other than ggplot and base graphics. The following are a tiny selection of the options. These might not be as fleshed out as ggplot or base but they contain features such as interactivity and larger amounts of automation.
This is a plotting library that allows you to generate interactive plots directly from R. It does this by rendering them in JavaScript (using a technique we will see shortly).
One beautiful thing about plotly is the ability to export ggplot objects directly to it.
library(plotly)
#grab some of R's built in data.
d <- diamonds[sample(nrow(diamonds), 1000), ]
#generate a ggplot object
p <- ggplot(data = d, aes(x = carat, y = price)) +
geom_point(aes(text = paste("Clarity:", clarity)), size = 2) +
geom_smooth(aes(colour = cut, fill = cut)) + facet_wrap(~ cut)
#send it to plotly to recreate
ggplotly(p)
Now our normal ggplots can have interactivity which can be absolutely fantastic for exploring outliers/ presenting data in an engaging way.
Plotly is not limited simply to re-rendering ggplot. It is capable of rendering three dimensional and/or high performance visualizations using the same engine that video games use.
In the next example z is a matrix of data corresponding to a two parameter normal likelihood. We pass it to plotly and tell it to draw a surface plot and …
plot_ly(z = resMat, type = "surface")